Chapter 6 Part 2

Shrinkage

Author
Affiliation

Tyler George

Cornell College
STA 362 Spring 2024 Block 8

Setup

library(tidyverse)
library(tidymodels)
library(gridExtra)
library(ISLR2)
library(leaps)

Shrinkage Methods

Ridge regression and Lasso - The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.

  • As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.

  • It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.

Another Reason

  • Sometimes we can’t solve for \(\hat\beta\)
  • Why?
  • Sometimes we can’t solve for \(\hat\beta\)
  • We have more variables than observations ( \(p > n\) )
  • The variables are linear combinations of one another
  • The variance can blow up

What can we do about this?

Ridge Regression

  • What if we add an additional penalty to keep the \(\hat\beta\) coefficients small (this will keep the variance from blowing up!)
  • Instead of minimizing \(RSS\), like we do with linear regression, let’s minimize \(RSS\) PLUS some penalty function
  • \(RSS + \underbrace{\lambda\sum_{j=1}^p\beta^2_j}_{\textrm{shrinkage penalty}}\)
  • What happens when \(\lambda=0\)? What happens as \(\lambda\rightarrow\infty\)?

Ridge Regression

  • Recall, the least squares fitting procedure estimates \(\beta_0,...,\beta_p\) using the values that minimize \[RSS = \sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2\]
  • Ridge regression coefficient estimates, \(\hat{\beta}^R\) are the values that minimize

\[\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2+\lambda\sum_{j=1}^p\beta_j^2\]

\[ = RSS + \lambda\sum_{j=1}^p\beta_j^2\]

where \(\lambda\geq 0\) is a tuning parameter, to be determined separately

More on Ridge

  • Like least squares, ridge regression seeks coefficient estimates taht fit the data well by making the RSS small.

  • The second term \(\lambda\sum_j\beta_j^2\) is called a shrinkage penalty, is small when \(\beta_1,...\beta_p\) are close to 0, and so it has the effect of shrinking the estimates of \(\beta_j\) toward 0.

Shinkage

Each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of \(\lambda\).

Shinkage Coeff

  • This displays the same ridge coefficient estimates as the previous graphs, but instead of displaying \(\lambda\) on the x-axis, we now display \(||\hat{\beta}_\lambda^R||_2/||\hat{\beta}||_2\), where \(\hat{\beta}\) denotes the vector of the least squares coefficient estimates.

  • The notation \(||\beta||_2\) denotes the

Ridge - Scalling Predictors

  • The standard least squares coefficient estimates are scale equivalent: multiplying \(X_j\) by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of \(1=c\). In other words, regardless of how the jth predictor is scaled, \(X_j\hat{\beta}_j\) will remain the same.

  • In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.

  • Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula

  • \[\tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{2}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}\]

Ridge Regression

  • IMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)

Choosing \(\lambda\)

  • \(\lambda\) is known as a tuning parameter and is selected using cross validation
  • For example, choose the \(\lambda\) that results in the smallest estimated test error

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As \(\lambda\) ☝️, bias ☝️, variance 👇

Ridge Bias-variance tradeoff

  • Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coefficients. Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of \(\lambda\) and \(||\hat{\beta}_\lambda^R||_2/||\hat{\beta}||_2\). The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest.

Lasso

  • Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model

  • The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, \(\hat{\beta}_\lambda^L\), minimize the quantity

\[\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2+\lambda\sum_{j=1}^p|\beta_j|\]

\[ = RSS + \lambda\sum_{j=1}^p|\beta_j|\]

where \(\lambda\geq 0\) is a tuning parameter, to be determined separately

  • In statistics lingo, the lasso uses an \(\ell_1\) (pronounced “ell 1”) penalty instead of an \(\ell_2\) penalty. The \(\ell_1\) norm of a coefficient vector \(\beta\) is given by \(||\beta||_1 = \sum|\beta_j|\)

to be continued